fix: async llm engine didnt have get_metrics()#1943
Conversation
Signed-off-by: Terry Kong <terryk@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds a compatibility layer to the vLLM worker module for retrieving metrics with a fallback mechanism when Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/models/generation/vllm/vllm_worker.py (1)
542-554:⚠️ Potential issue | 🔴 CriticalBug: iterating over empty
metricsdict instead ofvllm_prom_metrics.Line 549 iterates
for metric in metrics, butmetricsis the empty dict initialized on line 540. This meansvllm_prom_metricsis computed but never read, and the method always returns{}. This should iterate overvllm_prom_metrics.🐛 Proposed fix
if hasattr(self.llm, "get_metrics"): vllm_prom_metrics = self.llm.get_metrics() else: # The AsyncLLM API does not implement get_metrics so we need to call the prometheus API ourselves from vllm.v1.metrics.reader import get_metrics_snapshot vllm_prom_metrics = get_metrics_snapshot() - for metric in metrics: + for metric in vllm_prom_metrics: if hasattr(metric, "values"): metrics[metric.name] = metric.values elif hasattr(metric, "value"): metrics[metric.name] = metric.value
🤖 Fix all issues with AI agents
In `@nemo_rl/models/generation/vllm/vllm_worker.py`:
- Around line 544-548: The loop is iterating the wrong variable: replace the
iteration over the accumulating dict `metrics` with the Prometheus snapshot list
`vllm_prom_metrics` so the metric objects from get_metrics_snapshot() are
processed; find the loop in vllm_worker.py (inside the block that imports
get_metrics_snapshot) and change `for metric in metrics:` to `for metric in
vllm_prom_metrics:` and ensure the loop reads metric.name and metric.value /
metric.values as currently expected to populate the `metrics` dict.
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
closes #1934
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Bug Fixes
Tests